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We investigate zero temperature Gibbs learning for two classes of unrealiz- 
able rules which play an important role in practical applications of multilayer 
neural networks with differentiable activation functions: classification problems 
and noisy regression problems. Considering one step of replica symmetry break- 
ing, we surprisingly find that for sufficiently large training sets the stable state is 
replica symmetric even though the target rule is unrealizable. Further the classi- 
fication problem is shown to be formally equivalent to the noisy regression prob- 
lem. 

Neural networks with differentiable activation functions play an important role in 
practical applications [p. Besides being used for regression, they are often applied to 
classification problems as well, since gradient based methods are available for training 
such networks. In both cases, given a training set of P input/output pairs (£^,6^), 
£ M G Wt N , Ofj, E R, one adapts the network with output a to minimize a cost function 
which measures the deviation between cr(£ M ) and the target output 6^. 

For the regression problem we shall assume that the target output is a function r 
of the input, corrupted by additive noise, so 9^ := t(£ m ) + 7^. The noise terms 
are independent and normally distributed. An appropriate cost function then is the 
quadratic error 
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We call e t , the mean energy per example, training error. The main goal of learning, 
however, is to minimize the prediction error e p , defined as the expectation value of the 
training error on a new example, that is e p = ^(er(£) — 6(0) 2 "j /2, where the average 
is performed over the distribution of inputs and the randomness of 9 in the presence of 
noise. 

In classification problems only a binary label is available for the examples and we 
shall assume that M = A sign(r(^J). Here r is some function of the input and A is 
a tunable parameter. One is then mainly interested in the sign of the networks output, 
that is the goal of learning is to minimize the classification error 

e c = (e(-a(Or(a)) , (2) 

where 6 is the Heavyside step function. However the empirical mean of this perfor- 
mance measure 

P' 1 E 9{-<r(£M (3) 

is piecewise constant and cannot be optimized using e.g. backpropagation. While the 
sample complexity of training multilayer networks based on (fj) has been analysed in 
[|2|, H, practical applications of neural networks [[]], ||] typically use the differen- 
tiable cost function ([]]) even for classification tasks. So for the purposes of training, 
classification is mapped onto regression, and the question arises how this affects the 
generalization behaviour. (Alternative cost functions have been studied in the context 
of online learning [|7j].) 

Here we present a theoretical investigation of the two learning problems. We focus 
on a simple two-layered student network which consists of K hidden units with acti- 
vation function g(x) = erf(x/\/2) and iV-dimensional weight vectors {Ji]f =1 , where 
Jf = N. The output unit is linear and has weights fixed to the value l/y/K. Then, the 
output of this network which is called "soft-committee machine" |9|] is 

The target function r(£) will be given by a soft-committee machine with the same 
number of hidden units as the student network and weight vectors {Bi]f =l , where 
Bi ■ Bj = NSij. So the classification problem is perfectly learnable in the sense that 
the student network can achieve e c = if its weight vectors become identical to those 
of the teacher network. Further, we assume the components of the examples to be 
independent random numbers with mean zero and unit variance. 

We use the well-known replica formalism to investigate these problems in the ther- 
modynamic limit iV — >• oo. This requires the calculation of the quenched free energy 

(5) 

n=0 
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where Z n is the partition function / dix{{Jf}) exp(— (3 YZ=i H{{Ji)f=i)) °f n repli- 
cas (labeled a,b = 1, 2, 3, . . . ) of the student network JTT| , [12| ]. Here if is interpreted 
as the energy of a system which is in thermal equilibrium at a temperature T = 1/(3. 
In the limit of zero temperature, (3 — > oo, F is the optimal value of the energy which 
can be achieved by minimizing H with respect to the network weights. 

Introducing an additional integration over the order parameters Qfj := Jf ■ Jj/N 

and := J" • Bj/N, which is performed as a saddle point integration in the limit of 
large N, we find for moments of the partition function, ln(Z n ) = —N(aKG r + s) | e xtr- 
Here G r is an effective Hamiltonian and the entropy term s = (1/2) lndet C, where 
C is the K(n + 1) x K(n + 1) -dimensional matrix of the order parameters [[TOtl ■ We 
have further introduced the rescaled number of examples, a = P/ (NK). 

In the following we restrict ourselves to the limit of large K. We make a site- 
symmetric Ansatz for the dependence of the order parameters on the site indices i, j: 

( R a \ R a 
R?j = ^i — + A a \+(l-5 lJ )— (6) 



Qfj = + sA + (1 - 6^ (7) 



The scaling of the unspecialized order parameters with the number of hidden units 
results from the condition that the outputs of the students a a must be of order 1 in the 
limit of infinite K. In this limit the calculation of G r can be carried out analytically 
since the joint distribution of the a a and r becomes Gaussian [Q]. 

For the regression problem, using a one step replica symmetry breaking Ansatz, 
we obtain in the limit n — > 0: 

G° = i(^ + ^lnX 3 + llnX 2 ) ; (8) 

2 \X 2 m m/ 

where 

Xy = (3(v° -2w + l/3 + 7 2 ) 

X 2 = 1 + (3{u + (m - l)v l - mv°) 

X 3 = l + (3{u- v 1 ) . 

u = 1/3+Q/tt, v 1 = f(5\ Q 1 ), v° = f(5°, Q°), and w = /(A, R) are the covariances 
of the a a and r, where 

f{x,y) = -arcsin ( ^ J + - . 

7T \ Z / 7T 

It is easy to calculate s and to perform the limit n — > to obtain the entropy term 
s° in the free energy which is the same as for hard committee machines [|]]. The 
order parameter A indicates specialization of the network: if A = 0, the network 
configuration is unspecialized, i.e. a weight vector of the student network has the same 
overlap (R/K) with all weight vectors of the teacher network, whereas a positive A 
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indicates a specialized configuration where each of the student vectors has a greater 
overlap (R/K + A) with one of the teacher vectors than with the others. Q/K is 
the cross-overlap between different weight vectors of a student. The remaining order 
parameters Q°, Q 1 , 5°, 5 1 and m parametrize the distribution of overlaps between 
the weight vectors of different students. Note that as in [JTIJ, using the saddle point 
equations for the free energy, one may analytically eliminate the unspecialized order 
parameters R, Q, Q° and Q l . 

In terms of the order parameters the prediction error for the regression problem is 
given by: 



1 Q R 2 . /A\ 7 2 

e„ = — arcsm — M . (9) 

p 3 2vr vr vr \2J 2 



The replica calculation for the classification problem is analogous. It yields the 
same entropy s° and a G° r of the form (j8j) with identical X 2 and X 3 but 

X x = p(v° - 2w\^Q/^ + A 2 ) . (10) 
For the prediction and classification error one finds 

1 Q . [Q (2 . fA\ R\ A 2 
e P = 7r + ;r--<W- -arcsm — +— +— (11) 



p 6 2tt VvtItt V 2 J 7T 2 



1 

— arccos 

7T 



'2 /A 
— arcsm — 
7T V 2 




(12) 



In the limit of large sample size P the training error e t will converge to e p . So for the 
classification error to become zero the value of A must be chosen so that the minima of 
e p and e c coincide. Note that the order parameters are constrained by the fact that the 
vectors a := (1/iV) Y%Li J% an d b := Bj must fulfill (a ■ b) 2 < a 2 b 2 , which demands 
Q > (A + R) 2 — 1. Minimizing the prediction error ( pTj ) under this restriction, we 
obtain A = 1 and Q = R = (student and teacher network identical, e c = 0) only for 



A = A D = y 7r/6. This is the optimal value of A which allows asymptotically perfect 
classification. Inserting A = A Q in equation < JT0| ) and comparing to (||), one finds that 
in this case the free energy of the classification problem is identical to that of a noisy 



regression problem with 7 = 70 = y7r/Q — 1/3. In the sequel we shall only consider 
the case A = Ao for the classification problem. 

We focus on the limit of zero temperature and the construction of this limit depends 
on whether a zero training error is achievable. Denoting this critical capacity by cx c (j), 
we find that a c (j) decreases to with increasing 7 and a c (y) — > 1 as 7 — > 0. This is 
explained by the fact that the noise increases the magnitude of the target outputs. This 
correlates the hidden units of the student and thus reduces the storage capacity. 

Below ot c (7) we find an unspecialized replica symmetric solution with A = 5 1 = 
5° = 0. Above a c (7) one finds 5 1 — > 1 for (5 — ► 00 and the appropriate scaling is 
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1 — 5 1 = fj/P where f) is 0(1). To achieve nontrivial results m must also be scaled 
with (3 and we reparametrize m = rh/ (3. Then for a > oi c {^) the zero temperature 
free energy functional is given by: 



where k = (2\/3 — 3)/(ir — 3) and z(x) = (—3/tt)(x — 2 arcsin(a;/2)). The replica 
symmetric case may be recovered by either taking the limit rh — > or the limit 5° — > 1. 

These equations still admit an at least metastable unspecialized solution with A = 
for all a > « c (7). But now replica symmetry is broken in this solution, and this 
also holds in the noiseless case 7 = 0. Above a second critical a the stable solution 
is specialized (A > 0) and remarkably even in the noisy case this specialized solution 
does not exhibit replica symmetry breaking. 

The lifting of RSB with the onset of specialization is illustrated in Figure 1 for 
7 = 7 . Fixing A and maximizing ( JT3| ) w.r.t. to the remaining order parameters 
corresponds to calculating the free energy of a system with a state space constrained 
to vectors yielding a specialized student/teacher overlap of A. At the maximum F/P 
is the training error of the constrained system shown in Figure 1. 

The physically relevant states in the case of training with unconstrained A are 
given by the minima of these curves. Both in the RS and the RSB parametrizations 
we find a local minimum at A = 0, which corresponds to a metastable unspecialized 
configuration of the system. Here the RSB solution yields a greater free energy than 
the RS solution and therefore is the only physically relevant solution. 

With increasing A both curves approach each other, and the RSB and RS solutions 
merge at A ps 0.78, i.e. for sufficiently large A there is no replica symmetry breaking. 

There is a second minimum of the free energy at A fa 0.87 which corresponds to 
a replica symmetrical specialized phase of the learning with unconstrained A which 
yields a lower free energy than the unspecialized solution and therefore is the globally 
stable configuration. 

In general, we find the following scenario which is illustrated in the right panel 
of Figure 1 for 7 = 70 . For all values of a there is an unspecialized solution with 
constant prediction error (e p = 1/3 — 1/n + 7 2 /2). Replica symmetry is broken in 
this solution for a > a c {l)- Beyond a second critical a the unspecialized solution 
is only metastable and the stable solution is specialized and replica symmetric. In the 
noiseless case, the two critical values of a coincide, and thus replica symmetry is never 
broken in the stable state. In the noisy case, the prediction error decays as 1/a to its 
asymptotical value 7 2 /2 in the specialized phase. 




(13) 



•n + m(l — 5°) rh f) 
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Figure 1: Results for the classification problem and for the noisy regression problem 
with 7 = 7q. Left panel: e 4 (A) for a = 25. Dashed line: replica symmetrical solu- 
tion, solid line: one-step replica symmetry breaking solution. Right panel: e t {a). At 
a c(7o) ~ 0.3 replica symmetry is broken. At a « 21.5 replica symmetry is restored 
with the onset of specialization. The dashed line shows the (wrong) results of a replica 
symmetrical calculation. 



For the classification problem, 7 = 70, the 1/a decay of the prediction error trans- 
lates into the following asymptotics of the classification error: 

<c ~v5I[_L. (14) 

7T4 V fl 

This slow decay of e c reflects the cost of treating the classification problem as 
a regression problem and thus mapping a realizable case onto an unrealizable one. 
Based on the results of [Q] one would expect a 1/a asymptotics of the classification 
error, if the hard cost function (Q) would be used instead of the quadratic deviation ([]]). 
Thus future research into batch learning should investigate the use of cost functions 
like the ones proposed in [[7p for the online scenario. 

For the general case of noisy regression, it is remarkable that replica symmetry 
breaking is only a transient phenomenon in that the specialized state which is the stable 
one for large a is replica symmetric even in this unrealizable scenario. 
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